5 additional data blending examples for smarter SEO insights

Once you preprocess columns to consistent formatting, additional data blending options include prioritizing pages with search clicks, mining internal site search for content gaps, analyzing traffic issues with 404 pages and more.

When you start shifting your SEO work to be more data-driven, you will naturally look at all the data sources in your hands and might find it challenging to come up with new data blending ideas. Here is a simple shortcut that I often use: I don’t start with the data sources I have (bottoms up), but with the questions I need to answer and then I compile the data I need (top-bottom).

In this article, we will explore 5 additional SEO questions that we can answer with data blending, but before we dive in, I want to address some of the challenges you will face when putting this technique to practice.

Tony McCreath raised a very important frustration you can experience when data blending:

When you join separate datasets, the common columns need to be formatted in the same way for this technique to work. However, this is hardly the case. You often need to preprocess the columns ahead of the join operation.

It is relatively easy to perform advanced data joins in Tableau, Power BI and similar business intelligence tools, but when you need to preprocess the columns is where learning a little bit of Python pays off.

Here are some of the most common preprocessing issues you will often see and how you can address them in Python.

URLs

Absolute or relative. You will often find absolute and relative URLs. For example, Google Analytics URLs are relative, while URLs from SEO spider crawls are absolute. You can convert both to relative or absolute.

Here is how to convert relative URLs to absolute:

#convert relative URLs to absolute
from urllib.parse import urljoin
#relative 404 URLs from Search Console API: webmasters.urlcrawlerrorssamples.list
pageUrl = “product/mad-for-plaid-flannel-dress” #missing forward slash
print(urljoin(“https://www.example.com/”, pageUrl))
#Output -> https://www.example.com/product/mad-for-plaid-flannel-dress
#relative links from Google Analytics API: ga:landingPagePath
landingPagePath = “/about-chc/clinicians/audiology-technology/” #including forward slash
print(urljoin(“https://www.example.com/”, landingPagePath))
#Output -> https://www.example.com/about-chc/clinicians/audiology-technology/

Here is how to convert absolute URLs to relative:

#convert absolute URLs to relative
from urllib.parse import urlsplit, urlunsplit
#Absolute source URLs linking to 404s from Search Console API: webmasters.urlcrawlerrorssamples.list
linkedFromUrls= [
“http://www.example.com/brand/swirly/shopby?sizecode=99”,
“https://www.example.com/brand/swirly”
]
#first break url into parts
u = urlsplit(linkedFromUrls[0])
# u -> SplitResult(scheme=’http’, netloc=’www.example.com’, path=’/brand/swirly/shopby’, query=’sizecode=99′, fragment=”)
#then rebuild it back with empty scheme, netloc and fragment
relative_url = urlunsplit((“”, “”, o.path, o.query, “”))
print(relative_url)
#Output -> /brand/swirly/shopby?sizecode=99

Case sensitivity. Most URLs are case sensitive, but If the site is hosted on a Windows Server, you will often find URLs with different capitalization that return the same content. You can convert both to lowercase or upper case.

Here is how to convert them to lowercase:

#convert URL to lowercase
crawled_url = “https://www.example.com/ABOUT-chc/clinicians/audiology-technology/”
print(crawled_url.lower())
#Output -> https://www.example.com/about-chc/clinicians/audiology-technology/

Here is how to convert them to uppercase:

#convert URL to uppercase
crawled_url = “https://www.example.com/ABOUT-chc/clinicians/audiology-technology/”
print(crawled_url.upper())
#Output -> HTTPS://WWW.EXAMPLE.COM/ABOUT-CHC/CLINICIANS/AUDIOLOGY-TECHNOLOGY/

Encoding. Sometimes the URLs come from the URL parameter of another source URL and if they have query strings they will be URL encoded. When you extract the parameter value, the library you use might or might not do it for you.

Here is how to decode URL-encoded URLs

#decode URL-econded URL
url_source=“/url?sa=t&source=web&rct=j&url=https://support.google.com/webmasters/answer/35291%3Fhl%3Den&ved=2ahUKEwi42-aIwP3gAhUNON8KHf4EB-QQFjAIegQIChAB”
u = urlsplit(url_source)
print(u.query)
#Output -> ‘sa=t&source=web&rct=j&url=https://support.google.com/webmasters/answer/35291%3Fhl%3Den&ved=2ahUKEwi42-aIwP3gAhUNON8KHf4EB-QQFjAIegQIChAB’
#note the parameter ‘url’ is URL encoded because it includes a query string
url_params = parse_qs(u.query)
print(url_params[“url”])
#Output -> [‘https://support.google.com/webmasters/answer/35291?hl=en’]
#in case the URL is already encoded
encoded_url=‘https://support.google.com/webmasters/answer/35291%3Fhl%3Den’
print(unquote(encoded_url))
#Output -> ‘https://support.google.com/webmasters/answer/35291?hl=en’

Parameter handling. If the URLs have more than one URL parameter, you can face some of these issues:

  1. You might have parameters with no values.
  2. You might have redundant/unnecessary parameters.
  3. You might have parameters ordered differently

Here is how we can address each one of these issues.

#Handling URL parameters
from urllib.parse import urlsplit, urlunsplit
def clean_url_params(url):
print(url)
u = urlsplit(url)
print(u.query)
#example output -> ‘sizecode=99&sort=’
url_params = parse_qsl(u.query)
print(url_params)
#example output -> [(‘sizecode’, ’99’)]
#next let’s sort the parameters so they are always in the same order
url_params.sort(key=lambda tup: tup[0]) # sorts in place by parameter name
#now we need to rebuild the URL
new_query = urlencode(url_params)
print(new_query)
#example output -> ‘sizecode=99’
new_url = urlunsplit((u.scheme, u.netloc, u.path, new_query, “”))
print(new_url)
#example output -> ‘http://www.example.com/brand/swirly/shopby?sizecode=99’
#Absolute source URLs linking to 404s from Search Console API: webmasters.urlcrawlerrorssamples.list
linkedFromUrls= [
“http://www.example.com/brand/swirly/shopby?sizecode=99&sort=”,
“https://www.example.com/brand/swirly?sort=asc&sizecode=99”,
“https://www.example.com/brand/swirly?sizecode=99&sort=asc”,
]
#You might have parameters with no values. For example, sort=
clean_url_params(linkedFromUrls[0])
#Output: http://www.example.com/brand/swirly/shopby?sizecode=99&sort= -> http://www.example.com/brand/swirly/shopby?sizecode=99
#You might have parameters ordered differently
clean_url_params(linkedFromUrls[1])
clean_url_params(linkedFromUrls[2])
#Output:
# https://www.example.com/brand/swirly?sort=asc&sizecode=99 -> https://www.example.com/brand/swirly?sizecode=99&sort=asc
# https://www.example.com/brand/swirly?sizecode=99&sort=asc -> https://www.example.com/brand/swirly?sizecode=99&sort=asc

Dates

Dates can come in many different formats. The main strategy is to parse them from their source format into Python datetime objects. You can optionally manipulate the datetime objects. For example, to sort the dates correctly or to localize to a specific time zone. But, most importantly, you can easily format the datetime dates using a consistent convention.

Here are some examples:

#Reformating date strings
#Crawled and first discovered dates from the Search Console API: webmasters.urlcrawlerrorssamples.list
last_crawled= “2019-01-12T04:00:59.000Z” #ISO-8601 date
first_detected= “2018-11-19T02:59:25.000Z”
from datetime import datetime
#Here is how to parse dates the hard way. See https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior
dt = datetime.strptime(last_crawled, ‘%Y-%m-%dT%H:%M:%S.%fZ’) #you need to create a precise date format string
print(dt)
#output -> datetime.datetime(2019, 1, 12, 4, 0, 59)
#Here is how to do this the easy way using https://pypi.org/project/python-dateutil/
import dateutil.parser
dt = dateutil.parser.parse(last_crawled)
print(dt)
#output -> datetime.datetime(2019, 1, 12, 4, 0, 59, tzinfo=tzlocal())
#Google Analytics API ga:date
ga_date= “20190311”
dt = dateutil.parser.parse(ga_date)
print(dt)
#output -> datetime.datetime(2019, 3, 11, 0, 0)
#Latest links crawled date
Last_crawled= “Mar 4, 2019”
dt = dateutil.parser.parse(Last_crawled)
print(dt)
#output -> datetime.datetime(2019, 3, 4, 0, 0)
#finally we can format any these datetime objects using a consistent format
print(dt.isoformat())
#output -> ‘2019-01-12T04:00:59+00:00’

Keywords

Correctly matching keywords across different datasets can also be a challenge. You need to review the columns to see if the keywords appear as users would type them or there has been any normalization.

For example, is not uncommon for users to search by copying and pasting text. This type of keyword searches would include hyphens, quotes, trademark symbols, etc. that would not normally appear when typed. But, when typing, spacing and capitalization might be inconsistent across users.

In order to normalize keywords, you need to at least remove any unnecessary characters and symbols, remove extra spacing and standardize in lower case (or upper case).

Here is how you would do that in Python:

from urllib.parse import unquote_plus
#standardize keyword by removing extra characters, space, url decoding and lowercase capitalization
def normalize_keywords(keyword):
bad_chars=\’\”®” #add more as needed
table = str.maketrans(dict.fromkeys(bad_chars))
#if url encoded, decode
keyword = unquote_plus(keyword)
#lower case keyword
keyword = keyword.lower()
#remove extra whitespace
import re
keyword = re.sub(‘[\s]+’, ‘ ‘, keyword)
#remove bad chars
keyword = keyword.translate(table)
return keyword
#Here are some examples
#Check the dimensions and metrics explorer https://developers.google.com/analytics/devguides/reporting/core/dimsmets
#ga:keyword -> utm_term tracking parameter (manual campaign tracking) with quotes
ga_keyword = “‘solo female traveling\”
normalize_keywords(ga_keyword)
#Returns -> ‘solo female traveling’
#ga:searchKeyword -> Internal Search Term with special symbol
ga_search_keyword = “10 days in Disneyland®”
normalize_keywords(ga_search_keyword)
#Returns -> ’10 days in disneyland’
#ga:searchKeywordRefinement -> Subsequent internal search term with capitalized words and extra white space
ga_search_keyword_refinement = “10 day music cities NASHVILLE to NEW ORLEANS”
normalize_keywords(ga_search_keyword_refinement)
#Returns -> ’10 day music cities nashville to new orleans’
#ga:adMatchedQuery -> Search query that triggered impressions -> normalized
ga_ad_matched_query = “travel for single women”
normalize_keywords(ga_ad_matched_query)
#Returns -> ‘travel for single women’
#search_url -> Google Search -> copy and pasted text, URL encoded
from_search_url = “DA+doesn’t+influence+your+Google+rankings”
normalize_keywords(from_search_url)
#Returns -> ‘da doesn’t influence your google rankings’

Now that we know how to preprocess columns, let get to the fun part of the article. Let’s review some additional SEO data blending examples:

Error pages with search clicks

You have a massive list of 404 errors that you pulled from your web server logs because Google Search Console doesn’t make it easy to get the full list. Now you need to redirect most of them to recover traffic lost. One approach you could use is to prioritize the pages with search clicks, starting with the most popular ones!

Here is the data you’ll need:

Google Search Console: page, clicks

Web server log: HTTP request, status code = 404

Common columns (for the merge function): left_on: page, right_on: HTTP request.

Pages missing Google Analytics tracking code

Some sites choose to insert tracking codes manually instead of placing them on web page templates. This can lead to traffic underreporting issues due to pages missing tracking codes. You could crawl the site to find such pages, but what if the pages are not linked from within the site? One approach you could use is to compare the pages in Google Analytics and Google Search Console during the same time period. Any pages in the GSC dataset but missing in the GA set can potentially be missing the GA tracking script.

Here is the data you’ll need:

Google Search Console: date, page

Google Analytics: ga:date, ga:landingPagePath, filtered to Google organic searches.

Common columns (for the merge function): left_on: page, right_on: ga:landingPagePath.

Excluding 404 pages from Google Analytics reports

One disadvantage of inserting tracking codes in templates is that Google Analytics page views could trigger when users end up in 404 pages. This is generally not a problem, but it can complicate your life when you are trying to analyze traffic issues and can’t tell which traffic is good and ending in actual page content and which is bad and ending in errors. One approach you could use is to compare pages in Google Analytics with pages crawled from the website that return 200 status code.

Here is the data you’ll need:

Website crawl: URL, status code = 200

Google Analytics: ga:landingPagePath

Common columns (for the merge function): left_on: URL, right_on: ga:landingPagePath

Mining internal site search for content gaps

Let’s say that you review your internal site search reports in Google Analytics and find people coming from organic search and yet performing one or more internal searches until they find their content. It might be the case that there are content pieces missing that could drive those visitors directly from organic search. One approach you could use is to compare your internal search keywords with the keywords from Google Search Console. The two datasets should use the same date range.

Here is the data you’ll need:

Google Analytics: ga:date, ga:searchKeyword, filtered to Google organic search.

Google Search Console: date, keyword

Common columns (for the merge function): left_on: ga:searchKeyword, right_on: keyword

Checking Google Shopping organic search performance

Google announced last month that products listed in Google Shopping feeds can now show up in organic search results. I think it would be useful to check how much traffic you get versus the regular organic listings. If you add additional tracking parameters to the URLs in your feed, you could use Google Search Console data to compare the same products appearing in regular listings vs organic shopping listings.

Here is the data you’ll need:

Google Search Console: date, page, filtered to pages with the shopping tracking parameter

Google Search Console: date, page, filtered to pages without the shopping tracking parameter

Common columns (for the merge function): left_on: page, right_on: page